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Abstract 



Long-range migrations and the resulting admixture between populations have been an important 
force shaping human genetic diversity. Most existing methods for detecting and reconstructing his- 
torical admixture events are based on allele frequency divergences or patterns of ancestry segments 
in chromosomes of admixed individuals. An emerging new approach harnesses the exponential de- 
cay of admixture-induced linkage disequilibrium (LD) as a function of genetic distance. Here, we 
comprehensively develop LD-based inference into a versatile tool for investigating admixture. We 
present a new weighted LD statistic that can be used to infer mixture proportions as well as dates 
with fewer constraints on reference populations than previous methods. We define an LD-based 
three-population test for admixture and identify scenarios in which it can detect admixture that 
previous formal tests cannot. We further show that we can discover phylogenetic relationships 
between populations by comparing weighted LD curves obained using a suite of references. Fi- 
nally, we describe several improvements to the computation and fitting of weighted LD curves that 
greatly increase the robustness and speed of the computation. We implement all of these advances 
in a software package, ALDER, which we validate in simulations and apply to test for admixture 
among all populations from the Human Genome Diversity Project (HGDP), highlighting insights 
into the admixture history of Central African Pygmies, Sardinians, and Japanese. 
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INTRODUCTION 

Admixture between previously diverged populations has been a common feature of human evo- 
lution, from at least the early periods of modern Homo sapiens to the present day, and has left 



significant genetic traces in contemporary populations (Li et al. 2008, WALL et al. 2009 REICH 



etal. 2009, 


Green et al. 2010, 


Gravel et al. 2011, 


PUGACH et al. 2011, 


Patterson et al. 



2012 ). Resulting patterns of variation can provide information about migrations, demographic his- 
tories, and natural selection, and can also be a valuable tool for association mapping of disease 
genes in admixed populations ( |PATTERSON et al. 2004| ). 

Recently, a variety of methods have been developed to harness large-scale genotype data to 
infer admixture events in the history of sampled populations, as well as to estimate a range of gene 
flow parameters, including ages, proportions, and sources. Some of the most popular approaches, 
such as STRUCTURE ( |PRITCHARD et al. 2000] ) and principal component analysis (PCA) ( [PAT- 



TERSON et al. 2006] ), use clustering algorithms to identify admixed populations as intermediates 
in relation to surrogate ancestral populations. In a somewhat similar vein, local ancestry infer- 



ence methods (Tang et al. 2006; Sankararaman et al. 2008, Price et al. 2009; Lawson 



et al. 2012] ) analyze chromosomes of admixed individuals seeking to recover continuous blocks 
inherited directly from each ancestral population. Because recombination breaks down ancestry 
tracts through successive generations, the time of admixture can be inferred from the tract length 



distribution (POOL and NIELSEN 2009, PUGACH et al. 2011, GRAVEL 2012), with the caveat 



that accurate local ancestry inference becomes difficult as tract lengths decrease and the reference 
populations diverge from the true admixing populations. 

A third class of methods makes use of allele frequency differentiation among populations to 
deduce the presence of admixture and estimate parameters, either with likelihood-based models 
( IChikhi et al. 2001t [Wang 2003| [Sousa et al. 20091 |Wall et al. 2009| |Laval et al. 2010 



GRAVEL et al. 201 1 ) or with phylogenetic trees built by averaging moments of the site frequency 



spectrum over large sets of SNPs (REICH et al. 2009, GREEN etal. 2010 PATTERSON etal. 2012 



PlCKRELL and PRITCHARD 2012 LlPSON et al. 2012). For example, /-statistic-based three- and 



four-population tests for admixture (Reich et al. 20Q9\ |Green et al. 2010, Patterson et al 



2012) are highly sensitive in the proper parameter regimes and when the set of sampled popula- 



tions sufficiently represents the phylogeny. One disadvantage of drift-based statistics, however, is 
that because the rate of genetic drift depends on population size, these methods do not allow for 
inference of the time that has elapsed since admixture events. 

Finally, a fourth approach recently proposed by MOORJANI et al. (201 \) that we further de- 



velop in this article uses associations between pairs of loci to make inference about admixture. 
In general, linkage disequilibrium (LD) in a population can be generated by selection, genetic 
drift, or population structure, and it is eroded by recombination. Within a homogeneous popu- 
lation, steady-state neutral LD is maintained by the balance of drift and recombination, typically 



becoming negligible in humans at distances of more than a few hundred kilobases (REICH et al. 



2001 The International HapMap Consortium 2007 ). Even if a population is currently 



well-mixed, however, it can retain longer-range admixture LD (ALD) from admixture events in 
its history involving previously separated populations. ALD is caused by associations between 
nearby loci co-inherited on an intact chromosomal block from one of the ancestral mixing popula- 



tions (Chakraborty and Weiss 1988). Recombination breaks down these associations, leaving 
a signature of the time elapsed since admixture that can be probed by aggregating pairwise LD 
measurements through an appropriate weighting scheme; the resulting weighted LD curve (as a 
function of genetic distance) then exhibits an exponential decay with rate constant giving the age 
of admixture ( |MOQRJANl etal. 20lT]|PATTERSON etal. 2012[ ). This approach to admixture dating 



is similar in spirit to local ancestry-based admixture dating, but LD statistics have the advantage 
of a simple mathematical form that facilitates error analysis. 

In this paper, we comprehensively develop LD-based admixture inference, extending the method- 
ology to several novel applications that constitute a versatile tool for investigating admixture. We 
first propose a cleaner functional form of the underlying weighted LD statistic and provide a precise 
mathematical development of its properties. As an immediate result of this theory, we observe that 
our new weighted LD statistic can be used to infer mixture proportions as well as dates, extending 



the results of |PlCKRELL etal. (20 12). Moreover, such inference can still be performed (albeit with 
reduced power) when data are available from only the admixed population and one surrogate an- 
cestral population, whereas all previous techniques require at least two such reference populations. 
As a second application, we present an LD-based three -population test for admixture with sensitiv- 
ity complementary to the 3-population /-statistic test ( |REICH et al. 2009^ [PATTERSON et al. 2012] ) 



and characterize the scenarios in which each is advantageous. We further show that phylogenetic 
relationships between true mixing populations and present-day references can be inferred by com- 
paring weighted LD curves using weights derived from a suite of reference populations. Finally, 
we describe several improvements to the computation and fitting of weighted LD curves: we show 
how to detect confounding LD from sources other than admixture, improving the robustness of our 
methods against such effects, and we present a novel fast Fourier transform-based algorithm for 
weighted LD computation that reduces typical run times from hours to seconds. We implement all 
of these advances in a software package, ALDER (Admixture-induced Linkage Disequilibrium for 
Evolutionary Relationships). 

We demonstrate the performance of ALDER by using it to test for admixture among all HGDP 
populations ( Li et al. 2008| ) and compare its results to those of the 3-population test, highlighting 



the sensitivity trade-offs of each approach. We further illustrate our methodology with case stud- 
ies of Central African Pygmies, Sardinians, and Japanese, revealing new details that add to our 
understanding of admixture events in the history of each population. 

METHODS 

Properties of weighted admixture LD: In this section we introduce a weighted LD statistic that 
uses the decay of LD to detect signals of admixture given SNP data from an admixed population 
and reference populations. This statistic is similar to, but has an important difference from, the 



weighted LD statistic used in ROLLOFF (MOORJANI et al. 2011 , PATTERSON et al. 2012). The 



formulation of our statistic is particularly important in allowing us to use the amplitude of weighted 
LD to make inferences about history. We begin by deriving quantitative mathematical properties 
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of this statistic that can be used to infer admixture parameters. 



Basic model and notation: We will primarily consider a point-admixture model in which a pop- 
ulation C descends from a mixture of populations A and B to form C, n generations ago, in 
proportions a + (3 = 1, followed by random mating. As we will discuss later, we can assume 
for our purposes that the genetic drift between C and C is negligible, and hence we will simply 
refer to population C when there is no risk of ambiguity. We are interested in the properties of 
the LD in population C induced by admixture. Consider two biallelic, neutrally evolving SNPs x 
and y and for each SNP call one allele '0' and the other '1'. Denote by pa(x), Pb{%), Pa{u), an d 
Pb(u) the frequencies of the '1' alleles at x and y in the mixing populations A and B (at the time 
of admixture), and let 5(x) := Pa(x) — Pb(x) and 5(y) := Pa(v) — Pb(v) be the allele frequency 
differences. 

Let d denote the genetic distance between x and y and assume that x and y were in linkage 
equilibrium in populations A and B. Then the LD in population C immediately after admixture is 

D = a/35(x)5(y), 

where D is the standard haploid measure of linkage disequilibrium as the covariance of alleles at x 



and y ( |Chakraborty and Weiss 1988[ ). After n generations of random mating, the LD decays 
to 

D n = e- nd D = e- nd a(35(x)5(y) 



assuming infinite population size ( CHAKRABORTY and WEISS 1988| ). For a finite population, the 



above formula holds in expectation with a small adjustment factor caused by post-admixture drift 



(Ohta and Kimura 1971 ) 



E[D n ] = e- nd e- n/2N *af35(x)5(y), 

where N e is the effective population size. In most applications the adjustment factor e " n / 2N <= is 
negligible, so we will omit it in what follows ( |MoORJANl et al. 2012[ Note SI). 



In practice, our data consist of unphased diploid genotypes, so we expand our notation accord- 
ingly. Consider sampling a random individual from population C. We use a pair of {0, 1} random 
variables X\ and X 2 to refer to the two alleles at x and define random variables Y\ and Y 2 likewise. 
Our unphased SNP data represent observations of the {0, 1, 2} random variables X := X\ + X 2 
and Y :=Y 1 + Y 2 . 

Define z(x, y) to be the covariance 

z(x, y) : = cov(X, Y) = cov(X 1 + X 2 , Y x + Y 2 ), (1) 
which can be decomposed into a sum of four haplotype covariances: 

z{x, y) = cov(X 1 , Fx) + cov(X 2 , Y 2 ) + cov(Xi, Y 2 ) + cov(X 2 , Y{). (2) 

The first two terms measure D for the separate chromosomes, while the third and fourth terms 
vanish, since they represent covariances between variables for different chromosomes, which are 
independent. Thus, the expected total diploid covariance is simply 

E[z(x,y)] = 2e- nd af36(^(y)- (3) 



Relating weighted LD to admixture parameters: MOORJANI et al. (201 1| ) first observed that pair 



wise LD measurements across a panel of SNPs can be combined to enable accurate inference of the 
age of admixture, n. The crux of their approach was to harness the fact that the ALD between two 
sites x and y scales as e~ nd multiplied by the product of allele frequency differences 8(x)S(y) in 
the mixing populations. While the allele frequency differences 5(-) are usually not directly com- 
putable, they can often be approximated. Thus, MOORJANI et al. (201 1[ ) formulated a statistic, 



ROLLOFF, that dates admixture by fitting an exponential decay e nd to correlation coefficients 



between LD measurements and surrogates for 5(x)S(y). (Note that MOORJANI et al. (201 1 1 define 
z(x, y) as a sample correlation coefficient, analogous to the classical LD measure r, as opposed to 
the sample covariance ([T]) we use here; we find the latter more mathematically convenient.) 

We build upon these previous results by deriving exact formulas for weighted sums of ALD 
under a variety of weighting schemes that serve as useful surrogates for 5(x)8(y) in practice. These 
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calculations will allow us to interpret the magnitude of weighted ALD to obtain additional leverage 
on admixture parameters. Additionally, the theoretical development will generally elucidate the 
behavior of weighted ALD and its applicability in various phylogenetic scenarios. 



Following MOORJANI et al. (201 1 1, we partition all pairs of SNPs (x, y) into bins of roughly 
constant genetic distance: 

S{d) := {(x,y) \\x-y\m d}. 

Given a choice of weights w(-), one per SNP, we define the weighted LD at distance d as 

£s(d) z(x,y)w(x)w{y) 

a{) := — m\ • 

Assume first that our weights are the true allele frequency differences in the mixing popula- 
tions, i.e., w(x) = 5(x) for all x. Applying ([3]), 



E[a{d)} = E 



\S{d)\ 



Y. s[d) ^E[^) 2 Kyf]e- nd 



15(d) I 

= 2a/3F 2 (A 1 B) 2 e' nd } (4) 
where F 2 (A, B) is the expected squared allele frequency difference for a neutral allele, as defined 



in 



Reich et al. (2009) and Patterson et al. (2012). Thus, a(d) has the form of an exponential 



decay as a function of d, with time constant n giving the date of admixture. 

In practice, we must compute an empirical estimator of a(d) from a finite number of sampled 
genotypes. Say we have a set of m diploid admixed samples from population C indexed by i = 
1, . . . , m, and denote their genotypes at sites x and y by Xj, y, L G {0, 1, 2}. Also assume we have 
some finite number of reference individuals from A and B with empirical mean allele frequences 
Pa(-) andp B (-). Then our ALDER software computes 

I2s(d) coy ( X > Y )(M x ) ~ PB(x))(p A (y) -p B {y)) 

a{d) ■= m\ ' (5) 



where 



j m 

cov(X, Y) = - x)(yi - y) 

m — 1 ^-^ 



i=i 
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is the usual unbiased sample covariance, so E[a(d)] = a(d) (assuming no background LD, so the 
ALD in population C is independent of the drift processes producing the weights). 

The weighted sum J2s{d) z ( x -> y) w ( x ) w (y) is a natural quantity to use for detecting ALD decay 
and is common to our weighted LD statistic a(d) and previous formulations of ROLLOFF. Indeed, 
for SNP pairs (x, y) at a fixed distance d, we can think of equation ([3]) as providing a simple lin- 
ear regression model between LD measurements z(x, y) and allele frequency divergence products 
8(x)5(y). In practice, the linear relation is made noisy by random sampling, as noted above, but 
the regression coefficient 2a(3e~ nd can be inferred by combining measurements from many SNP 
pairs (x, y); in fact, the weighted sum J2s(d) z(x,y)8( x )Hy) m tne numerator of formula (|5]) is 
precisely the numerator of the least-squares estimator of the regression coefficient, which is the 



formulation of ROLLOFF given in MOORJANI et al. (2012, Note SI). Note that measurements 



of z(x,y) cannot be combined directly without a weighting scheme, as the sign of the LD can 
be either positive or negative; additionally, the weights tend to preserve signal from ALD while 
depleting contributions from other forms of LD. 

Up to scaling, our ALDER formulation is roughly equivalent to the regression coefficient for- 



mulation of ROLLOFF (MOORJANI et al. 2012, Note SI). In contrast, the original ROLLOFF 



statistic (Patterson etal. 2012) computed a correlation coefficient between z(x, y) and w(x)w(y) 



over S(d). However, the normalization term yXls(d) z ( x -> v) 2 m me denominator of the correla- 
tion coefficient can exhibit an unwanted rf-dependence that biases the inferred admixture date if the 



admixed population has undergone a strong bottleneck (MOORJANI et al. 2012 Note SI) or in the 
case of recent admixture and large sample sizes. Beyond correcting the date bias, the d(d) curve 
that ALDER computes has the advantage of a simple form for its amplitude in terms of meaningful 
quantities, providing us additional leverage on admixture parameters. Additionally, we will show 
that d(d) can be efficiently computed via a new fast Fourier transform-based algorithm. 

Using weights derived from diverged reference populations: In the above development we set the 
weights w(x) to equal allele frequency differences 5(x) between the true mixing populations A and 
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B. In practice, in the absence of ancient DNA data, it is impossible to sample allele frequencies 
from the time of mixture (an event in the past); instead, we substitute reference populations A' and 
B' that are accessible, setting w(x) = 5'(x) :— Pa'(x) — Pb'(x). In a given data set, the closest 
surrogates A' and B' may be somewhat diverged from A and B, so it is important to understand 
the consequences for the weighted LD a(d). 

We show in Appendix 1 that with reference populations A' and B' in place of A and B, equa- 
tion Q for the expected weighted LD curve changes only slightly, becoming 

E[a(d)} = 2a(3F 2 (A", B") 2 e~ nd , (6) 



where A" and B" are the branch points of A' and B' on the A—B lineage (Figure SI ). Notably, 
the curve still has the form of an exponential decay with time constant n, the age of admixture, 
albeit with its amplitude (and therefore signal-to-noise) attenuated according to how far A" and 
B" are from the true ancestral mixing populations. Drift along the A' -A" and B'-B" branches 
likewise decreases signal-to-noise but in the reverse manner: higher drift on these branches makes 
the weighted LD curve noisier but does not change its expected amplitude. As above, given a real 
data set containing finite samples, we compute an estimator a(d) analogous to formula that has 
the same expectation as a(d). 

Using the admixed population as one reference: Weighted LD can also be computed with only 



a single reference population by using the admixed population as the other reference (PlCKRELL 



et al. 2012, Supplement Sec. 4). Assuming first that we know the allele frequencies of the ancestral 



mixing population A and the admixed population C, the formula for the expected curve becomes 

E[a{d)] = 2a(3 3 F 2 {A, B) 2 e~ nd . (7) 

Using C itself as one reference population and X' as the other reference (which could branch 
anywhere between A and B), the formula for the amplitude is slightly more complicated, but the 



curve retains the e nd decay (Figure S2 ): 



E[a{d)\ = 2a(3{aF 2 {A, X") - f3F 2 {B, X")) 2 e - nd . (8) 
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Derivations of these formulas are given in Appendix 1 . 

A subtle but important technical issue that arises when computing weighted LD with a single 
reference is that the true weighted LD statistic, 

a(d) = cav(X,Y)(ji x -p(x))(fiy-p(y)), 

where 

\h> = ap A (x) + (3p B (x) and fj, y = ap A (y) + (3p B (y) 

are the mean allele frequencies of the admixed population (ignoring drift) and p(-) denote allele 
frequencies of the reference population, cannot be estimated by the naive formula 

cov(X, Y){(l x - p{x)){jly - p(y)). 

that would appear to be a natural analog of ([5]). The difficulty is that the covariance term and the 
weights both involve the allele frequencies fi x and thus, while the standard estimators for each 
term are individually unbiased, their product is a biased estimate of the weighted LD. 



PlCKRELL et al. (2012) circumvents this problem by partitioning the admixed samples in two 
groups, designating one group for use as admixed representatives and the other as a reference 
population; this method eliminates bias but reduces statistical power. We instead compute a poly- 



ache statistic (File SI ) that provides an unbiased estimator a(d) of the weighted LD with maximal 
power. 

Affine term in weighted LD curve from assortative mating: Weighted LD curves computed on real 
populations often exhibit a nonzero horizontal asymptote contrary to the exact exponential decay 
formulas we have derived above. Such behavior can be caused by assortative mating in violation of 
our model. We show in Appendix 1 that if we instead model the admixed population as consisting 
of randomly mating subpopulations with heterogeneous amounts a — now a random variable — of 
mixed ancestry, our equations for the curves take the form 

E[a{d)} = Me~ nd + K (9) 
12 



for constants M and K. Conveniently, however, the sum M + K/2 satisfies the same equations 
that the coefficient of the exponential does in the homogeneous case: adjusting equation ([6]) for 
assortative mating gives 

M + K/2 = 2afiF 2 (A", B"f (10) 
for two-reference weighted LD, and in the one-reference case, modifying equation ([8]) gives 

M + K/2 = 2a(3(aF 2 (A, X") — /3F 2 (B, X")) 2 . (11) 

For brevity, from here on we will take the amplitude of an exponential-plus-affine curve to mean 
M + K/2. 

Admixture inference using weighted LD: We now describe how the theory we have developed 
can be used to investigate admixture. We detail novel techniques that use weighted LD to infer 
admixture parameters, test for admixture, and learn about phylogeny. 

Inferring admixture dates and fractions using one or two reference populations: As noted above, 
our ALDER formulation of weighted LD hones the original two-reference admixture dating tech- 
nique of ROLLOFF ( |MoORJANl et al. 20lT| ), correcting a possible bias ( |MOQRJANl et al. 2012 



Note SI), and the one-reference technique ( |PlCKRELL et al. 2012j ), improving statistical power. 



PlCKRELL et al. (2012) also observed that weighted LD can be used to estimate ancestral mixing 
fractions. We further develop this application now. 

The main idea is to treat our expressions for the amplitude of the weighted LD curve as equa- 
tions that can be solved for the ancestry fractions a and (3 = 1 — a. First consider two-reference 
weighted LD. Given samples from an admixed population C and reference populations A' and B', 
we compute the curve a(d) and fit it as an exponential decay plus affine term: a(d) ~ Me~ nd + K. 
Let a := M + K/2 denote the amplitude of the curve. Then equation ([10]) gives us a quadratic 



equation that we can solve to obtain an estimate a of the mixture fraction a, 

2d(l - a) F 2 (A", B") 2 = d , 
13 



assuming we can estimate F 2 (A", B") 2 . Typically the branch point populations A" and B" are 
unavailable, but their F 2 distance can be computed by means of an admixture tree ( |PATTERSON 



etal. 2012[ |PlCKRELL and PRITCHARD 2012] |LlPSON et al. 2012| ). A caveat of this approach is 
that a and 1 — a produce the same amplitude and cannot be distinguished by this method alone; 
additionally, the inversion problem is ill-conditioned near a = 0.5. 

The situation is more complicated when using the admixed population as one reference. First, 



the amplitude relation from equation (11) gives a quartic equation in d: 



2d(l - a)[aF 2 (A, X") - (1 - &)F 2 (B, X")} 2 



a . 



Second, the F 2 distances involved are in general not possible to calculate from an admixture 



tree (Patterson et al. 2012, Lipson etal. 2012). In the special case that one of the true mixing 



populations is available as a reference, however — i.e., X' = A — PlCKRELL et al. (2012) demon- 
strated that mixture fractions can be estimated much more easily. From equation ([7]), the expected 
amplitude of the curve is 2af3 3 F 2 (A, B) 2 . On the other hand, assuming no drift in C since the 
admixture, a quick calculation shows that 

F 2 (A,C)=(3 2 F 2 (A,B), 

and F 2 (A, C) is estimable directly from the sample data. Combining these relations, we can obtain 
our estimate d by solving the equation 



2d/(l-d) = ao/F 2 (A,Cf 



(12) 



In practice, the true mixing population A is not available for sampling, but a closely-related 



population A' may be. In this case, the value of d given by equation ( [12] ) with A' in place of 
A is a lower bound on the true mixture fraction a (Appendix 1). This bounding technique is 
the most compelling of the above mixture fraction inference approaches, as prior methods cannot 
perform such inference with only one reference population. In contrast, when more references 



are available, admixture tree-fitting methods readily estimate mixture fractions (PATTERSON etal. 
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20121 IPlCKRELL and PRITCHARD 20121 ILlPSON etal. 2012b. In such cases we believe that these 



/-statistic -based methods are more robust than LD-based inference, which suffers from the de- 
generacy of solutions noted above; however, the weighted LD approach can provide confirmation 
based on a different genetic mechanism. 

Testing for admixture: Thus far, we have taken it as given that the population C of interest is ad- 
mixed and developed methods for inferring admixture parameters by fitting weighted LD curves. 
Now we consider the question of whether weighted LD can be used to determine whether admix- 
ture occurred in the first place. We develop a weighted LD-based formal test for admixture that 



is broadly analogous to the drift-based 3-population test ( |REICH et al. 2009} |PATTERSON et al. 
2012| ) but sensitive in different scenarios. 



A complication of interpreting weighted LD is that certain demographic events other than ad- 
mixture can also produce positive weighted LD that decays with genetic distance, particularly in 
the one-reference case. Specifically, if population C has experienced a recent bottleneck or an 
extended period of low population size, it may contain long-range LD. Furthermore, this LD typi- 
cally has some correlation with allele frequencies in C; consequently, using C as one reference in 
the weighting scheme produces a spurious weighted LD signal. 

In the two-reference case, LD from reduced population size in C is generally washed out by 
the weighting scheme assuming the reference populations A' and B' are not too genetically similar 
to C. The reason is that the weights <$(•) = Pa'(-) — Pb'(') arise from drift between A' and B' that 
is independent of demographic events producing LD in C (beyond genetic distances that are so 
short that the populations share haplotypes descended without recombination from their common 
ancestral haplotype). Thus, observing a two-reference weighted LD decay curve is generally good 
evidence that population C is admixed. There is still a caveat, however. If C and one of the 



references, say A', share a recent population bottleneck (Figure S3 ), then the bottleneck-induced 
LD in C can be correlated to the allele frequencies of A', resulting once again in spurious weighted 
LD: in fact, the one-reference case we discussed above is the limiting case A' = C of this situation. 
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With these considerations in mind, we propose an LD-based three-population test for admixture 
that includes a series of pre-tests safeguarding against the pathological demographies that can 
produce a non-admixture weighted LD signal. We outline the test now; details are in Appendix 2. 
Given a population C to test for admixture and references A' and B', the main test is whether the 
observed weighted LD a(d) using A'-B' weights is positive and well-fit by an exponential decay 
curve. We estimate a jackknife-based p-value by leaving out each chromosome in turn and re- 
fitting the weighted LD as an exponential decay; the jackknife then gives us a standard error on 
the fitting parameters — namely, the amplitude and the decay constant — that we use to measure the 
significance of the observed curve. 

The above procedure allows us to determine whether there is sufficient signal in the weighted 
LD curve to reject the null hypothesis (under which a(d) is random "colored" noise in the sense 
that it contains autocorrelation). However, in order to conclude that the curve is the result of 
admixture, we must rule out the possibility that it is being produced by demography unrelated to 
admixture. We therefore apply the following pre-test procedure. First, we determine the distance 
to which LD in C is significantly correlated with LD in either A' or B'\ to minimize signal from 
shared demography, we subsequently ignore data from SNP pairs at distances smaller than this 
correlation threshold. Then, we compute one-reference weighted LD curves for population C with 
A'-C and B'-C weights and check that the curves are well-fit as exponential decays. In the case 
that C is actually admixed between populations related to A' and B', the one-reference A'-C and 
B'-C curves pick up the same e~ nd admixture LD decay signal. If C is not admixed but has 
experienced a shared bottleneck with A' (producing false-positive admixture signals from the A'- 
B' and B'-C curves), however, the A'-C weighting scheme is unlikely to produce a weighted LD 
curve, especially when fitting beyond the LD correlation threshold. 

This test procedure is intended to be conservative, so that a population C identified as admixed 
can strongly be assumed to be so, whereas if C is not identified as admixed, we are less confident 
in claiming that C has experienced no admixture whatsoever. In situations where distinguishing 
admixture from other demography is particularly difficult, the test will err on the side of caution; for 
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example, even if C is admixed, the test may fail to identify C as admixed if it has also experienced 
a bottleneck. Also, if a reference A' shares some of the same admixture history as C or is simply 
very closely related to C, the pre-test will typically identify long-range correlated LD and deem A' 
an unsuitable reference to use for testing admixture. 

Learning about phylogeny: Given a triple of populations (C; A', B'), our test can identify admix- 
ture in the test population C, but what does this imply about the relationship of populations A' 
and B' to C? As with the drift-based 3-population test, test results must be interpreted carefully: 
even if C is admixed, this does not necessarily mean that the reference populations A' and B' are 
closely related to the true mixing populations. However, computing weighted LD curves with a 
suite of different references can elucidate the phylogeny of the populations involved: our ampli- 



tude formulas < |T0| ) and ( fTTj ) provide information about the locations on the phylogeny at which the 
references diverge from the true mixing populations. 

More precisely, in the notation of Figure |ST] the amplitude of the two-reference weighted LD 
curve is 2a(3F 2 (A", B") 2 , which is maximized when A" = A and B" = B and is minimized when 
A" = B" . So, for example, we can fix A' and compute curves for a variety of references B'\ the 
larger the resulting amplitude, the closer the branch point B" is to B. In the one-reference case, as 
the reference X' is varied, the amplitude 2a(3(aF 2 (A, X") — (3F 2 (B, X")) 2 traces out a parabola 
that starts at 2a(3 3 F 2 (A, B) 2 when X" = A, decreases to a minimum value of 0, and increases 



again to 2a 3 (3F 2 (A, B) 2 when X" = B (Figure S2). Here, the procedure is more qualitative 
because the branches F 2 (A, X") and F 2 (B, X") are less directly useful and the mixture proportions 
a and /3 may not be known. 

Implementation of ALDER: We now describe some more technical details of the software pack- 
age we implemented, ALDER. 



Fast Fourier transform algorithm for computing weighted LD: We developed a novel algorithm 
that algebraically manipulates the weighted LD statistic into a form that can be computed using a 
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fast Fourier transform (FFT), dramatically speeding up the computation (File[S2]). The algebraic 



transformation is made possible by the simple form ([5]) of our weighted LD statistic along with a 



genetic distance discretization procedure that is similar in spirit to ROLLOFF (MOORJANI et al 



2011 1 but subtly different; instead of binning the contributions of SNP pairs (x, y) by discretizing 
the genetic distance \x — y\ = d, we discretize the genetic map positions x and y themselves 
(using a default resolution of 0.05 cM). For two-reference weighted LD, the resulting FFT-based 
algorithm that we implemented in ALDER has computational cost that is approximately linear in 
the data size; in practice, it ran three orders of magnitude faster than ROLLOFF on typical data 
sets we analyzed. 

Curve-fitting: We fit discretized weighted LD curves a(d) as Me~ nd + K from equation (|9]), us- 
ing least-squares to find best-fit parameters. This procedure is similar to ROLLOFF, but ALDER 
makes two important technical advances that significantly improve the robustness of the fitting. 
First, ALDER directly estimates the affine term K that arises from assortative mating by using 
inter-chromosome SNP pairs that are effectively at infinite genetic distance (Appendix 1). The al- 
gorithmic advances we implement in ALDER enable efficient computation of the average weighted 
LD over all pairs of SNPs on different chromosomes, giving K and, importantly, eliminating one 
parameter from the exponential fitting. In practice, we have observed that ROLLOFF fits are some- 
times sensitive to the maximum inter-SNP distance d to which the weighted LD curve is computed 
and fit; ALDER eliminates this sensitivity. 

Second, because background LD is present in real populations at short genetic distances and 
confounds the ALD signal (interfering with parameter estimates or producing spurious signal en- 
tirely), it is important to fit weighted LD curves starting only at a distance beyond which back- 
ground LD is negligible. ROLLOFF used a fixed threshold of d > 0.5 cM, but some populations 
have longer-range background LD (e.g., from bottlenecks), and moreover, if a reference population 
is closely related to the test population, it can produce a spurious weighted LD signal due to recent 
shared demography. ALDER therefore estimates the extent to which the test population shares 
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correlated LD with the reference(s) and only fits the weighted LD curve beyond this minimum 
distance as in our test for admixture (Appendix 2). 

We estimate standard errors on parameter estimates by performing a jackknife over the auto- 
somes used in the analysis, leaving out each in turn. Note that the weighted LD measurements 
from individual pairs of SNPs that go into the computed curve a(d) are not independent of each 
other; however, the contributions of different chromosomes can reasonably be assumed to be inde- 
pendent. 

Data sets: We primarily applied our weighted LD techniques to a data set of 940 individuals in 



53 populations from the CEPH-Human Genome Diversity Cell Line Panel (HGDP) (ROSENBERG 



et al. 2002| ) genotyped on an Illumina 650K SNP array ( |Ll etal. 2008] ). To study the effect of SNP 



ascertainment, we also analyzed the same HGDP populations genotyped on the Affymetrix Human 



Origins Array (PATTERSON etal. 2012 ). For some analyses we also included HapMap Phase 3 data 



(The International HapMap Consortium 2010) merged either with the Illumina HGDP 



data set, leaving approximately 600K SNPs, or with the Indian data set of REICH et al. (2009) 
including 16 Andaman Islanders (9 Onge and 7 Great Andamanese), leaving 500K SNPs. 

We also constructed simulated admixed chromosomes from 112 CEU and 113 YRI phased 
HapMap individuals using the following procedure, described in MOORJANI et al. (201 \\ . Given 



desired ancestry proportions a and (3, the age n of the point admixture, and the number m of 
admixed individuals to simulate, we built each admixed chromosome as a composite of chromo- 
somal segments from the source populations, choosing breakpoints via a Poisson process with 
rate constant n, and sampling blocks at random according to the specified mixture fractions. We 
stipulated that no individual haplotype could be reused at a given locus among the m simulated 
individuals, preventing unnaturally long identical-by-descent segments but effectively eliminating 
post-admixture genetic drift. For the short time scales we study (admixture occurring 200 or fewer 
generations ago), this approximation has little impact. We used this method in order to maintain 
some of the complications inherent in real data. 
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RESULTS 

Simulations: First, we demonstrate the accuracy of several forms of inference from ALDER on 
simulated data. We generated simulated genomes for mixture fractions of 75% YRI / 25% CEU 
and 90% YRI / 10% CEU and admixture dates of 10, 20, 50, 100, and 200 generations ago. For 
each mixture scenario we simulated 40 admixed individuals according to the procedure above. 

We first investigated the admixture dates estimated by ALDER using a variety of reference 
populations drawn from the HGDP with varying levels of divergence from the true mixing popu- 
lations. On the African side, we used HGDP Yoruba (essentially identical to HapMap YRI) and 
San; on the European side, we used French (very close to CEU), Han, and Papuan. We computed 
two-reference weighted LD curves using pairs of references, one from each group, as well as one- 
reference curves using the simulated population as one reference and each of the above HGDP 
populations as the other. 



For the 75% YRI mixture, estimated dates are nearly all accurate to within 10% (Table [ST]). 
The noise levels of the fitted dates (estimated by ALDER using the jackknife) are the lowest for the 
Yoruba-French curve, as expected, followed by the one-reference curve with French, consistent 
with the admixed population being mostly Yoruba. The situation is similar but noisier for the 
90% YRI mixture (Table |S2|); in this case, the one-reference signal is quite weak with Yoruba and 
undetectable with San as the reference, due to the scaling of the amplitude (equation < fTT| )) with the 
cube of the CEU mixture fraction. 

We also compared fitted amplitudes of the weighted LD curves for the same scenarios to those 
predicted by formulas ( |T0| ) and ( fTTj ); the accuracy trends are similar (Tables S3} S4). Finally, we 



tested formula ( [12] ) for bounding mixture proportions using one-reference weighted LD ampli- 
tudes. We computed lower bounds on the European ancestry fraction using French, Russian, Sar- 
dinian, and Kalash as successively more diverged references. As expected, the bounds are tight for 
the French reference and grow successively weaker (Tables |S5j|S6]). We also tried lower-bounding 
the African ancestry using one-reference curves with an African reference. In general, we expect 
lower bounds computed for the major ancestry proportion to be much weaker (Appendix 1), and 
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indeed we find this to be the case, with the only slightly diverged Mandenka population producing 
extremely weak bounds. An added complication is that the Mandenka are an admixed population 



with a small amount of West Eurasian ancestry (Price et al. 2009 1, which is not accounted for in 
the amplitude formulas we use here. 

Another notable feature of ALDER is that, to a much greater extent than /-statistic methods, 
its inference quality improves with more samples from the admixed test population. As a demon- 
stration of this, we simulated a larger set of 100 admixed individuals as above, for both 75% YRI / 
25% CEU and 90% YRI / 10% CEU scenarios, and compared the date estimates obtained on sub- 



sets of 5-100 of these individuals with two different reference pairs (Tables S7 S8). With larger 
sample sizes, the estimates become almost uniformly more accurate, with smaller standard errors. 
By contrast, we observed in the course of running experiments that increasing the sample sizes for 
reference populations has only minimal effects on the performance of ALDER (data not shown). 
This is similar to the phenomenon that the precision of /-statistics does not improve appreciably 
with more than a moderate number of samples and is due to the inherent variability in genetic drift 
among different loci. 

Robustness: A challenge of weighted LD analysis is that owing to various kinds of model vio- 
lation, the parameters of the exponential fit of an observed curve a(d) may depend on the starting 
distance do from which the curve is fit. We therefore explored the robustness of the fitting param- 
eters to the choice of d in a few scenarios (Figure [T]). First, in a simulated 75% / 25% YRI-CEU 
admixture 50 generations ago, we find that the decay constant and amplitude are both highly ro- 
bust to varying d from 0.5 to 2.0 cM (Figure [TJ top). This result is not surprising because our 
simulated example represents a true point admixture with minimal background LD in the admixed 
population. 

In practice, we expect some dependence on do due to background LD or longer-term admixture 
(either continuously over a stretch of time or in multiple waves). Both of these will tend to increase 
the weighted LD for smaller values of d relative to an exact exponential curve, so that estimates of 
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the decay constant and amplitude will decrease as we increase the fitting start point d ; the extent 
to which this effect occurs will depend on the extent of the model violation. We studied the do- 
dependence for two example admixed populations, HGDP Uygur and HapMap Maasai (MKK). 
For Uygur, the estimated decay constants and amplitudes are fairly robust to the start point of 
the fitting, varying roughly by ±10% (Figure [TJ middle). In contrast, the estimates for Maasai 
vary dramatically, decreasing by more than a factor of 2 as d is increased from 0.5 to 2.0 cM 
(Figure [T] bottom). This behavior is likely due to multiple-wave admixture in the genetic history 
of the Maasai; indeed, it is visually evident that the weighted LD curve for Maasai deviates from 
an exponential fit (Figure[T|) and is in fact better-fit as a sum of exponentials. 

It is also important to consider the possibility of SNP ascertainment bias, as in any study based 
on allele frequencies. We believe that for weighted LD, ascertainment could have modest effects 



on the amplitude, which depends on F 2 distances ( |PATTERSON et al. 2012} |LlPSON et al. 2012] ), 
but will not affect the estimated date. Running ALDER on a suite of admixed populations in the 
HGDP under a variety of ascertainment schemes suggests that admixture date estimates are indeed 



quite stable to ascertainment (Table S9). 



Admixture test results on HGDP populations: To compare the sensitivity of our LD-based test 
for admixture to the 3-population test, we ran both ALDER and the 3-population test on all triples 
of populations in the HGDP. Interestingly, while the tests concur on the majority of the populations 
they identify as admixed, each also identifies several populations as admixed that the other does 
not (Table[T|), showing that the tests have differing sensitivity to different admixture scenarios. 

Admixture identified by only ALDER: The 3-population test loses sensitivity primarily as a result 
of drift since splitting from the references' lineages. More precisely, using the notation of Fig- 
ure 



SI the 3-population test statistic fz(C] A\ B') estimates the sum of two directly competing 
terms: — 2a/3F 2 (A, B), the negative quantity arising from admixture that we wish to detect, and 
F 2 (A", A) + F 2 (B", B) + 2F 2 (C, C), a positive quantity from the "off-tree" drift branches. If the 
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latter term dominates, the 3-population test will fail to detect admixture regardless of the statistical 
power available. For example, Melanesians are only found to be admixed according to the ALDER 
test; the inability of the 3-population test to identify them as admixed is almost certainly due to 
the long off-tree drift from the Papuan branch prior to admixture. The situation is similar for the 
Pygmies. 

Small mixture fractions also diminish the size of the admixture term —2a(3F 2 (A, B) relative 
to the off-tree drift, and we believe this effect along with post-admixture drift may be the reason 
Sardinians are detected as admixed only by ALDER. In the case of the San, who have a small 



amount of Bantu admixture ( [PlCKRELL et al. 2012[ ) the small mixture fraction may again play a 



role along with the lack of a reference population closely related to the pre-admixture San, meaning 
that using existing populations incurs long off-tree drift. 

Admixture identified by only the 3-population test: There are also multiple reasons why the 3- 
population test can identify admixture that ALDER does not. For the HGDP European populations 
in this category (Table [T}, the 3-population test is picking up a signal of admixture identified by 



Patterson et al. (2012) and interpreted there as a large-scale admixture event in Europe involv- 
ing Neolithic farmers closely related to present-day Sardinians and an ancient northern Eurasian 
population. This mixture likely began quite anciently (e.g. 7,000-9,000 years ago when agriculture 



arrived Europe ( |BRAMANTI et al. 2009} |SOARES et al. 2010[ |PlNHASl et al. 2012| )), and because 



admixture LD breaks down as e~ nd , where n is the age of admixture, there is nearly no LD left 
for ALDER to harness beyond the correlation threshold d . An additional factor that may inhibit 
LD-based testing is that in order to prevent false-positive identifications of admixture, ALDER typ- 
ically eliminates reference populations that share LD (and in particular, admixture history) with 
the test population, whereas the 3-population test can use such references. 

To summarize, the ALDER and 3-population tests both analyze a test population for admixture 
using two references, but they detect signal based on different "genetic clocks." The 3-population 
test uses signal from genetic drift, which can detect quite old admixture but must overcome a 
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counteracting contribution from post-admixture and off-tree drift. The LD-based test uses recom- 
bination, which is relatively unaffected by small population size-induced long drift and has no 
directly competing effect, but has limited power to detect chronologically old admixtures because 
of the rapid decay of the LD curve. Additionally, as discussed above in the context of simulation 
results, the LD-based test may be better suited for large data sets, since its power is enhanced more 
by the availability of many samples. The tests are thus complementary and both valuable. 

Case studies: We now present detailed results for several human populations, all of which ALDER 
identifies as admixed but are not found by the 3-population test (Table [T]). We infer dates of 
admixture and in some cases additional historical insights. 

Pygmies: Both Central African Pygmy populations in the HGDP, the Mbuti and Biaka, show 
evidence of admixture (Table [T]), about 28 ± 4 generations (800 years) ago for Mbuti and 38 ± 4 
generations (1100 years) ago for Biaka, estimated using San and Yoruba as reference populations 
(Figure [2}\,C). The intra-population heterogeneity is low, as demonstrated by the negligible affine 
terms. In each case, we also generated weighted LD curves with the Pygmy population itself as 
one reference and a variety of second references. We found that using populations French, Han, or 
Yoruba as the second reference gave very similar amplitudes, but the amplitude was significantly 
smaller with the other Pygmy population or San as the second reference (Figure |2jB,D). Using the 
amplitudes with Yoruba, we estimated mixture fractions of at least 15.9 ± 0.9% and 28.8 ± 1.4% 
Yoruba-related ancestry for Mbuti and Biaka, respectively. 

The phylogenetic interpretation of the relative amplitudes is complicated by the fact that the 
Pygmy populations, used as references, are themselves admixed, but a plausible coherent expla- 
nation is as follows (see Figure |2^). We surmise that a proportion (3 (lower bounds given above) 
of Bantu-related gene flow reached the native Pygmy populations on the order of 1000 years ago. 
The common ancestors of Yoruba and non- Africans with the Bantu population are genetically not 
very different from Bantu, due to high historical population sizes (branching at positions X\ and 
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X 2 in Figure[2ji). Thus, the weighted LD amplitudes using Yoruba or non-Africans as second ref- 
erences are nearly 2a 3 (3F 2 (A, B) 2 , where B denotes the admixing Bantu population. Meanwhile, 
San and Western (resp. Eastern) Pygmies split from the Bantu-Mbuti (resp. Biaka) branch toward 



the middle or the opposite side from Bantu (X 3 and X 4 ), giving a smaller amplitude (Figure S2) 



Our results are in agreement with previous studies that have found evidence of gene flow from 



agriculturalists to Pygmies (QuiNTANA-MURCl et al. 2008, VERDU et al. 2009 PATIN et al 



2009| |Jarvis etal. 20121 ). |Quintana-Murci et al. (2008j ) suggested based on mtDNA evidence 



in Mbuti that gene flow ceased several thousand years ago, but more recently, JARVIS et al. (2012] > 



found evidence of admixture in Western Pygmies, with a local-ancestry-inferred block length dis- 
tribution of 3.1 ± 4.6 Mb (mean and standard deviation), consistent with our estimated dates. 

Sardinians: We detect a very small proportion of Sub-Saharan African ancestry in Sardinians, 
which our ALDER tests identified as admixed (Table [TJ Figure [3}\). To investigate further, we com- 
puted weighted LD curves with Sardinian as a test population and all pairs of the HapMap CEU, 
YRI and CHB populations as references (Table [2]). We observed an abnormally large amount of 
shared long-range LD in chromosome 8, likely do to an extended inversion segregating in Euro- 



peans (Price et al. 2008), so we omitted it from these analyses. The CEU- YRI curve has the 
largest amplitude, suggesting both that the LD present is due to admixture and that the small non- 
European ancestry component, for which we estimated a lower bound of 0.6±0.2%, is from Africa. 
The existence of a weighted LD decay curve with CHB and YRI as references provides further ev- 
idence that the LD is not simply due to a population bottleneck or other non-admixture sources, as 
does the fact that our estimated dates from all three reference pairs are roughly consistent at about 
40 generations (1200 years). Our findings thus confirm the signal of African ancestry in Sardinians 



reported in MOORJANI et al. (201 1[ ). The date, small mixture proportion, and geography are con- 



sistent with a small influx of migrants from North Africa, who themselves traced only a fraction of 



their ancestry ultimately to Sub-Saharan Africa, consistent with the findings of |DUPANLOUP et al. 
(20041). 
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Japanese: Genetic studies have suggested that present-day Japanese are descended from admix- 



ture between two waves of settlers, responsible for the Jomon and Yayoi cultures (HAMMER and 



Horai 1995||Hammer et al. 20061 |Rasteiro and Chikhi 2009] ) . We also observed evidence of 



admixture in Japanese (Table [T]), and while our ability to learn about the history is limited by the 
absence of a close surrogate for the original Paleolithic mixing population, we were able to take 
advantage of the one-reference inference capabilities of ALDER. We observed a clear weighted 
LD curve using HapMap JPT as the test population and JPT-CHB weights (Figure[3j3). This curve 
yields an estimate of 45 ± 6 generations, or about 1,300 years, as the age of admixture. To our 
knowledge, this is the first time genome-wide data have been used to date admixture in Japanese. 



As with previous estimates based on coalescence of Y-chromosome haplotypes (HAMMER et al. 



2006), our date is consistent with the archaeologically attested arrival of the Yayoi in Japan roughly 
2300 years ago (we suspect that our estimate is from later than the initial arrival because admix- 
ture may not have happened immediately). Based on the amplitude of the curve, we also obtain a 
(likely very conservative) genome- wide lower bound of 41 ± 3% "Yayoi" ancestry using formula 
( fT2] ) (under the reasonable assumption that Han Chinese are fairly similar to the Yayoi population). 
It is important to note that observation of a single -reference weighted LD curve is not sufficient 
evidence to prove that a population is admixed, but we did find a pair of references with which 
the ALDER test identified Japanese as admixed, which, combined with previous work and the lack 
of any signal of reduced population size, makes us confident that our inferences are based on true 
historical admixture. 

Onge: Lastly, we provide a cautionary example of weighted LD decay curves arising from demog- 
raphy and not admixture. We observed distinct weighted LD curves when analyzing the Onge, an 
indigenous population of the Andaman Islands. However, this curve is only present when using 
Onge themselves as one reference; moreover, the amplitude is independent of whether CEU, CHB, 
YRI, GIH (HapMap Gujarati), or Great Andamanese is used as the second reference (Figure [4]), 
as expected if the weighted LD is due to correlation between LD and allele frequencies in the test 



26 



population alone (and independent of the reference allele frequencies). Correspondingly, ALDER's 
LD-based test does not identify Onge as admixed using any pair of these references. Thus, while 
we cannot definitively rule out admixture, the evidence points toward internal demography (low 
population size) as the cause of the elevated LD, consistent with the current census of fewer than 
100 Onge individuals. 

DISCUSSION 

Strengths of weighted LD for admixture inference: The statistics underlying weighted LD are 
quite simple, making the formula for the expectation of a(d), as well as the noise and other errors 
from our inference procedure, relatively easy to understand. By contrast, local ancestry-based 
admixture dating methods (e.g., |POQL and NIELSEN (2009[ ) and |GRAVEL (2012| )) are sensitive 



to imperfect ancestry inference, and it is difficult to trace the error propagation to understand the 



ultimate effect on inferred admixture parameters. Similarly, the wavelet method of PUGACH et al 



(2011 ) uses reference populations to perform (fuzzy) ancestry assignment in windows, for which 
error analysis is challenging. 

Another strength of our weighted LD methodology is that it has relatively low requirements 
on the quality and quantity of reference populations. Our theory tells us exactly how the statistic 
behaves for any reference populations, no matter how diverged they are from the true ancestral 
mixing populations. In contrast, the accuracy of results from clustering and local ancestry methods 
is dependent on the quality of the reference populations used in ways that are difficult to character- 
ize. On the quantity side, previous approaches to admixture inference require a surrogate for each 
ancestral population, whereas as long as one is confident that the signal is truly from admixture, 
weighted LD can be used with only one available reference to infer times of admixture (as in our 



analysis of the Japanese) and bound mixing fractions (as in our Pygmy case study and PlCKRELL 
et al. (2012| )), problems that were previously intractable. 



Weighted LD also advances our ability to test for admixture. As discussed above, ALDER 
offers complementary sensitivity to the 3-population test and allows the identification of additional 
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populations as admixed. Another formal test for admixture is the 4-population test ( [REICH et al. 



2009, Patterson et al. 2012), which is quite sensitive but also has trade-offs; for example, it 
requires three distinctly branching references, whereas ALDER and the 3-population test only need 
two. Additionally, the phylogeny of the populations involved must be well understood in order 
to interpret a signal of admixture from the 4-population test properly (i.e., to determine which 
population is admixed). Using weighted LD, on the other hand, largely eliminates the problem of 
determining the destination or direction of gene flow, since the LD signal of admixture is intrinsic 
to a specified test population. 

One-reference versus two-reference curves: In practice, it is often useful to compute weighted 
LD curves using both the one-reference and two-reference techniques, as both can be used for 
inferences in different situations. Generally, we consider two-reference curves to be more reliable 
for parameter estimation, since using the test population as one reference is more prone to intro- 
duce unwanted signals, such as recent admixture from a different source, non-admixture LD from 
reduced population size, or undesired population structure among samples. In particular, popula- 
tions with more complicated histories and additional sources of LD beyond the specifications of our 
model will often have different estimates of admixture dates with one- and two-reference curves. 
There is a slight danger that date disagreement can reflect a false-positive admixture signal, but 
this is very unlikely if both one- and two-reference curves exist beyond the correlated LD thresh- 
old (see Appendix 2). Two-reference curves also allow for direct estimation of mixture fractions, 
although, as discussed above, we prefer instead to use the method of single-reference bounding. 
Moreover, there are a number of practical considerations that make the one-reference capabilities 
of ALDER desirable. Foremost is the possibility that one may not have a good surrogate available 
for one of the ancestral mixing populations, as in our Japanese example. Also, while our method 
of learning about phylogenetic relationships is best suited to two-reference curves because of the 
simpler form of the amplitude in terms of branch lengths, it is often useful to begin by comput- 
ing a suite of single-reference curves, both because the data generated will scale linearly with the 
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number of references available and because observing a range of different amplitudes gives an 
immediate signal of the presence of admixture in the test population. 

Overall, then, a sample sequence for applying ALDER to a new data set might be as follows: (1) 
test all populations for admixture using all pairs of references from among the other populations; 

(2) explore admixed populations of interest by comparing single -reference weighted LD curves; 

(3) learn more detail by analyzing selected two-reference curves alongside the one-reference ones; 

(4) estimate parameters using one- or two-reference curves as applicable. Of course, step (1) itself 
involves the complementary usefulness of both one- and two-reference weighted LD, since our test 
for admixture requires the presence of exponential decay signals in both types of curves. 

Effect of multiple-wave or continuous admixture: As discussed in our section on robustness 
of results, in the course of our data analysis, we observed that the weighted LD date estimate al- 
most always becomes more recent when the exponential decay curve is fit for a higher starting 
distance do. Most likely, this is because admixtures in human populations have taken place over 
multiple generations, such that our estimated times represent intermediate dates during the process. 
To whatever extent an admixture event is more complicated than posited in our point-admixture 
model, removing low-rf bins will lead the fitting to capture proportionately more of the more re- 
cent admixture. By default, ALDER sets d to be the smallest distance such that non-admixture 
LD signals can be confidently discounted for d > d (see Methods (Testing for admixture) and 
Appendix 2), but it should be noted that the selected d will vary for different sets of populations, 
and in each case the true admixture signal at d < d will also be excluded. Theoretically, this 
pattern could allow us to learn more about the true admixture history of a population, since the 
value of a(d) at each d represents a particular function of the amount of admixture that took place 
at each generation in the past. However, in our experience, fitting becomes difficult for any model 
involving more than two or three parameters. Thus, we made the decision to restrict ourselves to 
assuming a single point admixture, fit for a principled threshold d > d , accepting that the inferred 
date n represents some form of average value over the true history. 
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Other possible complications: In our derivations, we have assumed implicitly that the mixing 
populations and the reference populations are related through a simple tree. However, it may be 
that their history is more complicated, for example involving additional admixtures. In this case, 
our formulas for the amplitude of the ALD curve will be inaccurate if, for example, A and A' have 
different admixture histories. However, if our assumptions are violated only by events occurring 
before the divergences between the mixing populations and the corresponding references, then the 
amplitude will be unaffected. Moreover, no matter what the population history, as long as A and 
B are free of measurable LD (so that our assumption of independence of alleles conditional on a 
single ancestry is valid), there will be no effect on the estimated date of admixture. 

Conclusions and future directions: In this study, we have shown how linkage disequilibrium 
(LD) generated by population admixture can be a powerful tool for learning about history, extend- 



ing previous work that showed how it can be used for estimating dates of mixture (MOORJANI 



et al. 2011} |PATTERSON et al. 2012[ ). We have developed a new suite of tools, implemented in 
the ALDER software package, that substantially increases the speed of admixture LD analysis, 
improves the robustness of admixture date inference, and exploits the amplitude of LD as a novel 
source of information about history. In particular, (a) we show how admixture LD can be lever- 
aged into a formal test for mixture that can sometimes find evidence of mixture not detectable by 
other methods, (b) we show how to estimate mixture proportions, and (c) we show that we can 
even use this information to infer phylogenetic relationships. A limitation of ALDER at present, 
however, is that it is designed for a model of pulse admixture between two ancestral populations. 
An important direction for future work will be to generalize similar ideas to make inferences about 



the time course of admixture in the case that it took place over a longer period of time (POOL 



and NIELSEN 2009| |GRAVEL 2012| ) and to study multi-way admixture. In addition, it would be 
valuable to be able to use the information from admixture LD to constrain models of history for 
multiple populations simultaneously, either by extending ALDER itself or by using LD-based test 



results in conjunction with methods for fitting phylogenies incorporating admixture (PATTERSON 
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etal. 20121 IPickrell and Pritchard 20121 ILipson etal. 2012b 



SOFTWARE 

Executable and C++ source files for our ALDER software package are available online at the 



Berger and Reich Lab websites: |http : / /groups . csail .mit . edu/cb/ alder/ http : 



/ / genetics .med . harvard . edu/reich/Reich_Lab/ Software . html 
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APPENDIX 1: DERIVATIONS OF WEIGHTED LD FORMULAS 
Expected weighted LD using two diverged reference populations: We now derive equation ((6j) 
for the expected weighted LD using references A' and B' in place of A and B, retaining the notation 
of Figure [ST] Let A' and B' have allele frequencies p A >(-) and p B t (•) , and let 5' ( • ) : = p A > ( • ) — p B > ( • ) 
denote the allele frequency divergences with which we weight the LD z(x,y), giving the two-site 
statistic 

a(d) := z(x,y)5'(x)5'(y). 

(For brevity, we drop the binning procedure of averaging over SNP pairs (x, y) at distance \x—y\ ~ 
d here.) The value of the random variable z(x, y) is affected by sampling noise as well as genetic 
drift between A and B, while the random variables S'(x) and S'(y) are outcomes of genetic drift 
between A' and B' . These random variables are uncorrected conditional on the allele frequencies 
of x and y in A" and B". We also assume that x and y are distant enough to have negligible 
background LD and hence the drifts at the two sites are independent. We then have 

E[a(d)] = E[z(x,y)5'(x)5 / (y)] 

= E[E[z(x,y)5'(x)5'(y) | p A „(x),p B >'(x), pa"(v),Pb" (y)]] 

= E[2af35{x)5(y)5'(x)5'{y)e- nd } 

= 2a(3e- nd E 2 {A",B") 2 , 

where in the last step the relation E[8(x)5'(x)] = E[8(y)S'(y)] = F 2 (A",B") follows from the 
fact that the intersection of the drift paths S(-) and <$'(•) is the branch between A" and B" . 

Expected weighted LD using one diverged reference population: Using the admixed popu- 
lation C as one reference and a population A' as the other, we have pc(-) = c^Pa{ ) + Ppb(-) 
(assuming negligible post-admixture drift), giving weights 

Sa'c(-) = Pa'(-) - atp A (-) - Ppb{-) = <x6 A ' A (-) + /35 A > B (-), 
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where 5pq denotes the allele frequency difference between populations P and Q. Arguing as 
above, the expected weighted LD is given by 

E[a(d)} = Ei2a(35(x)5(y)5 A , c (x)5 A , c (y)e- nd }. 

To complete the calculation, we compute 

E[6(-)5 A , C (-)} = aE[6(-)5 A>A (-)} + 0E[5(-)6 A , B (-)). 

For the first term, the intersection of the A—B and A' -A drift paths is the A-A" branch, so 
E[6(-)S A ' A (-)] = —F 2 (A, A") with the negative sign arising because the paths traverse this branch 
in opposite directions. For the second term, the intersection of the A—B and A'-B drift paths is 
the A"-B branch (traversed in the same direction), so E[5(-)8 A 'b(-)} = F 2 (B,A"). Combining 
these results gives equation d8]>, as desired. (Note that a slight subtlety arises now that we are using 
population C in our weights: sites x and y can exhibit admixture LD at appreciable distances, 
so 5 A 'c{x) and 5 A 'c{y) are not independent. However, only the drift since admixture is corre- 
lated, which is negligible for typical scenarios we study in which admixture occurred 200 or fewer 
generations ago.) 

Bounding mixture fractions using one reference: We now establish our claim in the main text 
that the estimator a given in equation ([12]) for the mixture fraction a is a lower bound when the 



reference population A' is diverged from A. Equation ( [12] ) gives a correct estimate when A' = A 
but becomes an approximation when there is genetic drift between A and A' or between C and C 
(For accuracy, in this section we relax our usual assumption that C = C .) 



Rearranging equation ( |12| ), we have by definition 

2a do 



1-a" F 2 (A\C<y 
From equation @, the amplitude d is in truth given by 

a = 2a(3(-aF 2 (A, A") + 0F 2 {B, A ")) 2 e~ n/2N * 
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(13) 



where we have included the post-admixture drift multiplier e n / 2JVe from the C-C branch. It 
follows that 

a 2a _ n/2Ne „ 2a 



e -n/^v e < _ (14) 



(-aPF 2 (A,A") + P 2 F 2 (B,A")) 2 f5 I- a 

We claim that F 2 (A', C' f > (-af3F 2 (A, A") + (3 2 F 2 (B, A")) 2 , in which case combining ([13]) 



and ( 14) gives a/(l — a) < a/(l — a) and hence a < a. Indeed, we have 



F 2 (A',C) > F 2 (A",C) 

= a 2 F 2 (A,A") + p 2 F 2 (B,A") 
> -a/3F 2 (A,A") + /3 2 F 2 (B,A"). 

Squaring both sides appears to give our claim, but we must be careful because it is possible for 
the final expression to be negative. We will assume A' is closer to A than B, i.e., F 2 (A,A") < 
F 2 (B, A"). Then, if a < (3, the final expression is clearly positive. If a > (3, we have a 2 F 2 (A, A") > 
a(3F 2 (A,A") and so 

F 2 (A', C) > a 2 F 2 (A, A") + (3 2 F 2 (B, A") > a(3F 2 (A, A") — (3 2 F 2 (B, A"). 

Thus, squaring the inequality is valid in either case, establishing our bound. From the above 
we also see that the accuracy of the bound depends on the sizes of the terms that are lost in 
the approximation — aF 2 (A,A"), F 2 (A',A") and F 2 {C,C) — relative to the term that is kept, 
f3 2 F 2 (B,A"). In particular, aside from the bound being tighter the closer A' is to A, it is also 
more useful when the reference A' comes from the minor side a < 0.5. 



Affine term from assortative mating: In the above, we have assumed that population C is ho- 
mogeneously admixed; i.e., an allele in any random admixed individual from C has a fixed prob- 
ability a of having ancestry from A and (3 of having ancestry from B. In practice, many admixed 
populations experience assortative mating such that subgroups within the population have vary- 
ing amounts of each ancestry. Heterogeneous admixture among subpopulations creates LD that is 
independent of genetic distance and not broken down by recombination: intuitively, knowing the 
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value of an allele in one individual changes the prior on the ancestry proportions of that individual, 
thereby providing information about all other alleles (even those on other chromosomes). This 
phenomenon causes weighted LD curves to exhibit a nonzero horizontal asymptote, the form of 
which we now derive. 

We model assortative mating by taking a to be a random variable rather than a fixed probability, 
representing the fact that individuals from different subpopulations of C have different priors on 
their A ancestry. As before we set (3 := 1 — a and we now denote by a and (3 the population- 
wide mean ancestry proportions; thus, jj x = ap A (x) + (3p B (x). We wish to compute the expected 
diploid covariance E[z(x, y)\ which we saw in equation ([2]) splits into four terms corresponding to 
the LD between each copy of the x allele and each copy of the y allele. 

Previously, the cross-terms covpTi,!^) + cov(X 2 ,Y"i) vanished because a homogeneously 
mixed population does not exhibit inter-chromosome LD. Now, however, we have 

E[(Xi - fi x )(Y 2 - fiy) I p(A ancestry) = a] 
= E[X\ — [i x | p(A ancestry) = a] ■ E[Y 2 — \i v \ p(A ancestry) = a] 

= {ap A (x) + (3p B (x) - fx x )(ap A (y) + f3p B (y) - fi y ) 
= ((a - a)p A (x) + (J) - j3)p B (x))((a - a)p A {y) + {fi - /3)p B (y)) 
= ((a - a)p A (x) - {a- a)p B {x)){{a - a)p A (y) - (a - a)p B (y)) 
= (a — a) 2 S(x)S(y). 

It follows that 

cov(X 1; Y 2 ) = E[(a - a) 2 5(x)5(y)} = var(a)5(x)%) (15) 

and likewise for cov(X 2 , Yi). 

To compute the same-chromosome covariance terms, we split into two cases according to 
whether or not recombination has occurred between x and y since admixture. In the case that 
recombination has not occurred — i.e., the ancestry of the chromosomal region between x and y 
can be traced back as one single chunk to the time of admixture, which occurs with probability 
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e nd — the region from x to y has ancestry from A with probability a and from B with probability 
f3. Thus, 

E[(Xi — fi x )(Yi — fi y ) | no recomb, p(A ancestry) = a] 
= aE[(Xi — /i x )(Yi — fi y ) | A ancestry] + j3E[{X\ — fi x )(Yi — fi y ) \ B ancestry] 
= a(p A (x) - fi x )(p A (y) - fiy) + /3(p B (x) - fi x )(p B (y) - H y ) 

= a(/3p A (x) - (3p B (x))((3p A (y) - (3p B (y)) + 0(ap B (x) - ap a {x)){ap B {y) - ap A (y)) 
= (aft 2 + f3a 2 )5(x)5(y). 

Taking the expectation over the whole population, 

E[(Xi - fi x )(Y 1 - fiy) | no recomb] = [a]? + ^a 2 )5(x)5(y) = aP6(x)5(y) (16) 

as without assortative mating. 

In the case where there has been a recombination, the loci are independent conditioned upon 
the ancestry proportion a, as in our calculation of the cross-terms; hence, 

E[(X t - fi x ){Y x - fi y ) I recomb] = var(a)5(x)<%) (17) 

occurring with probability 1 — e~ nd . 



Combining equations ( [15] ), ( fT6[ ), and ( [17] ), we obtain 

E[z(x,y)} = E[{X - fi x ){Y - fi y )] 

= 2 var(a)5(x)8(y) + 2e- nd aP5{x)5{y) + 2(1 - e- nd )var{a)5{x)5(y) 
= (e- nd {2aP - 2 var(a)) + 4 var(a))8(x)8(y). 

Importantly, our final expression for E[z(x,y)] still factors as the product of a ^-dependent 
term — now an exponential decay plus a constant — and the allele frequency divergences S(x)S(y). 
Because it is the product 5(x)5(y) that interacts with our various weighting schemes, the formulas 
that we have derived for the weighted LD curve E[a(d)] — equations ([4]), ([6]), ([7]), and ([8]) — retain 
the same factors involving F 2 distances and change only in the replacement of 2a(3e' nd with 
e- nd {2a(3 - 2 var(a)) + 4 var(a). 
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APPENDIX 2: TESTING FOR ADMIXTURE 
Here we provide details of the weighted LD-based test for admixture we implement in ALDER. 
The test procedure is summarized in the main text; we focus here on technical aspects not given 
explicitly in Methods. 

Determining extent of LD correlation: The first step of ALDER estimates the distance to which 
LD in the test population is correlated to LD in each reference population. Such correlation sug- 
gests shared demographic history that can confound the ALD signal, so it is important to deter- 
mine the distance to which LD correlation extends and analyze weighted LD curves a(d) only for 
d greater than this threshold. Our procedure is as follows. We successively compute LD corre- 
lation for SNP pairs (x,y) within distance bins dk < \x — y\ < d k +i, where dk = kr for some 
bin resolution r (0.05 cM by default). For each SNP pair (x, y) within a bin, we estimate the LD 
(i.e., sample covariance between allele counts at x and y) in the test population and the LD in the 
reference population. We then form the correlation coefficient between the test LD estimates and 
reference LD estimates over all SNP pairs in the bin. We jackknife over chromosomes to estimate 
a standard error on the correlation, and we set our threshold after the second bin for which the 
correlation is insignificant (p > 0.05). To reduce dependence on sample size, we then repeat this 
procedure with successively increasing resolutions up to 0.1 cM and set the final threshold as the 
maximum of the cutoffs obtained. 

Determining significance of a weighted LD curve: To define a formal test for admixture based 
on weighted LD, we need to estimate the significance of an observed weighted LD curve d(d). 
This question is statistically subtle for several reasons. First, the null distribution of the curve a{d) 
is complex. Clearly the test population C should not be admixed under the null hypothesis, but 
as we have discussed, shared demography — particularly bottlenecks — can also produce weighted 
LD. We circumvent this issue by using the pre-tests described in the next section and assume that if 
the test triple (C; A', B') passes the pre-tests, then under the null hypothesis, non-admixture demo- 
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graphic events have negligible effect on weighted LD beyond the correlation threshold computed 
above. Even so, the a(d) curve still cannot be modeled as random white noise: because SNPs con- 
tribute to multiple bins, the curve typically exhibits noticeable autocorrelation. Finally, even if we 
ignore the issue of colored noise, the question of distinguishing a curve of any type — in our case, 
an exponential decay — from noise is technically subtle: the difficulty is that a singularity arises in 
the likelihood surface when the amplitude vanishes, which is precisely the hypothesis that we wish 



to test(DAV!ES 19771. 



In light of these considerations, we estimate a p-value using the following procedure, which 
we feel is well-justified despite not being entirely theoretically rigorous. We perform jackknife 
replicates of the a(d) curve computation and fitting, leaving out one chromosome in each replicate, 
and estimate a standard error for the amplitude and decay constant of the curve using the usual 
jackknife procedure. We obtain a "2-score" for the amplitude and the decay constant by dividing 
each by its estimated standard error. Finally, we take the minimum (i.e., less- significant) of these 
z-scores and convert it to a p-value assuming it comes from a standard normal; we report this 
p- value as our final significance estimate. 

Our intuition for this procedure is that checking the "z-score" of the decay constant essentially 
tells us whether or not the exponential decay is well-determined: if the a(d) curve is actually just 
noise, then the fitting of jackknife replicates should fluctuate substantially. On the other hand, if 
the a(d) curve has a stable exponential decay constant, then we have good evidence that d(d) is 
actually well-fit by an exponential — and in particular, the amplitude of the exponential is nonzero, 
meaning we are away from the singularity. In this case the technical difficulty is no longer an issue 
and the jackknife estimate of the amplitude should in fact give us a good estimate of a ,2-score 
that is approximately normal under the null. The "z-score" for the decay constant certainly is not 
normally distributed — in particular, it is always positive — but taking the minimum of these two 
scores only makes the test more conservative. 

Perhaps most importantly, we have compelling empirical evidence that our z-scores are well- 
behaved under the null. We applied our test to nine HGDP populations that neither ALDER nor the 
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3-population test identified as admixed; for each test population, we used as references all popula- 
tions with correlated LD detectable to no more than 0.5 cM. These test triples thus comprise a suite 
of approximately null tests. We computed Q-Q plots for the reported z-scores and observed that 
for z > (our region of interest), our reported z-scores follow the normal distribution reasonably 



well, generally erring slightly on the conservative side (Figure S4). These findings give strong 
evidence that our significance calculation is sufficiently accurate for practical purposes; in reality, 
model violation is likely to exert stronger effects than the approximation error in our p- values, and 
although our empirical tests cannot probe the tail behavior of our statistic, for practical purposes 
the precise values of p-values less than, say, 1CT 6 are generally inconsequential. 

Pre-test thresholds: To ensure that our test is applicable to a given triple (C; A', B'), we need 
to rule out the possibility of pathological demography producing non-admixture-related weighted 
LD (Figure [S3]). We do so by computing the weighted LD curves with weights A'-B' , A'-C, and 
B'-C and fitting an exponential to each curve. To eliminate the possibility of a shared ancestral 
bottleneck between C and one of the references, we check that the three estimated amplitudes and 
decay constants are well-determined; explicitly, we compute a jackknife-based standard error for 
each parameter and require the implied p-value for the parameter being positive to be less than 
0.05. If so, we conclude that whatever LD is present is due to admixture, not other demography, 
and we report the p-value estimate defined above for the significance of the A'-B' curve as the 
p- value of our test. 

We are aware of one demographic scenario in which the ALDER test could potentially return a 



finding of admixture when the test population is not in fact admixed. As illustrated in Figure S3 
this would occur when A' and C have experienced a shared bottleneck and C has subsequently had 
a further period of low population size. We do not believe that we have ever encountered such a 
false positive admixture signal, but to guard against it, we note that if it were to occur, the three 
decay time constants for the reference pairs A'-B' , A'-C, and B'-C would disagree. Thus, along 
with the test results, ALDER returns a warning whenever the three best-fit values of the decay 
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constant do not agree to within 25%. 



Multiple-hypothesis correction: In determining statistical significance of test results when test- 
ing a population using many pairs of references, we apply a multiple-hypothesis correction that 
takes into account the number of tests being run. Because some populations in the reference set 
may be very similar, however, the tests may not be independent. We therefore compute an effective 
number n r of distinct references by running PCA on the allele frequency matrix of the reference 
populations; we take n r to be the number of singular values required to account for 90% of the 
total variance. Finally, we apply a Bonferroni correction to the p-values from each test using the 
effective number ( n 2 r ) of reference pairs. 
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Figure 1. Dependence of date estimates and weighted LD amplitudes on fitting start point. Rows 

correspond to three test scenarios: Simulated 75% YRI / 25% CEU mixture 50 generations ago 

with Yoruba-French weights (top); Uygur with Han-French weights (middle); HapMap Maasai 

with Yoruba-French weights (bottom). The left panel of each row shows the weighted LD curve 

a(d) (blue) with best-fit exponential decay curve (red), fit starting from d = 0.5 cM. Remaining 

panels show the date estimate (middle) and amplitude (right) as a function of do. (We note that 
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our date estimates for Uygur are somewhat more recent than those in Patterson et al. (2012), most 
likely because of our direct estimate of the affine term in the weighted LD curve.) 
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Figure 2. Weighted LD curves for Mbuti using San and Yoruba as reference populations (A) and 
using Mbuti itself as one reference and several different second references (B). Analogous curves 
for Biaka (C, D) swapping the roles of Mbuti and Biaka. Genetic distances are discretized into 
bins at 0.05 cM resolution. Data for each curve are plotted and fit (A, C) starting from the 
corresponding ALDE^-computed LD correlation thresholds. Different amplitudes of 
one-reference curves (B, D) imply different phy4k6genetic positions of the references relative to 
the true mixing populations, suggesting a sketch of a putative admixture graph for Mbuti (E). 
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Figure 3. Weighted LD curves for HGDP Sardinian using Italian- Yoruba weights (A) and 
HapMap Japanese (JPT) using JPT itself as one reference and HapMap Han Chinese (CHB) as 
the second reference (B). 
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Figure 4. Weighted LD curves for Onge using Onge itself as one reference and several different 
second references. 
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TABLES 



Table 1. Results of ALDER and 3-population tests for admixture on HGDP populations. 



Both 


#LD 


#h 


Only LD 


# 


Only h 


# 


Neither 


Adygei 


205 


139 


BiakaPygmy 


81 


French 


99 


Basque 


Balochi 


123 


204 


Colombian 


5 


Han 


13 


Dai 


BantuKenya 


30 


182 


Druze 


128 


Italian 


46 


Hezhen 


BantuSouthAfrica 


27 


11 


Japanese 


1 


Orcadian 


1 


Karitiana 


Bedouin 


300 


63 


Kalash 


20 


Tujia 


8 


Lahu 


Brahui 


363 


16 


MbutiPygmy 


77 


Tuscan 


59 


Mandenka 


Burusho 


450 


377 


Melanesian 


96 






Miao 


Cambodian 


266 


158 


Pima 


489 






Naxi 


Daur 


29 


8 


San 


155 






Papuan 


Han-NChina 


1 


77 


Sardinian 


45 






She 


Hazara 


699 


593 


Yakut 


435 






Surui 


Makrani 


173 


163 










Yi 


Maya 


784 


124 










Yoruba 


Mongola 


76 


385 












Mozabite 


313 


107 












Oroqen 


68 


5 












Palestinian 


308 


64 












Pathan 


113 


348 












Russian 


158 


153 












Sindhi 


264 


366 












Tu 


22 


315 












Uygur 


428 


616 












Xibo 


101 


335 













We ran both ALDER and the 3-population test for admixture on each of the 53 HGDP populations 
using all pairs of other populations as reference^oWe group the populations according to whether 
or not each test methodology produces at least one test identifying them as admixed; for each 
population, we list the number of reference pairs with which with each method (abbreviated "LD" 



Table 2. Amplitudes and dates from weighted LD curves with Sardinian using various 
reference pairs. 



Ref 1 Ref 2 Weighted LD amplitude Date estimate 

CEU YRI 0.00003192 ± 0.00000903 48 ±10 

CHB YRI 0.00001738 ± 0.00000679 34 ± 8 

CEU CHB 0.00000873 ± 0.00000454 52 ± 21 

Data are shown from ALDER fits to weighted LD curves computed using Sardinian as the test 
population and pairs of HapMap CEU, YRI, CHB as the references. We omitted chromosome 8 
from the analysis because of anomalous long-range LD. Curves a(d) were fit for d > 1.2 cM, the 
extent of LD correlation between Sardinian and CEU computed by ALDER. 



50 



SUPPORTING INFORMATION 




Figure SI. Notational diagram of phylogeny containing admixed population and references. 
Population C is descended from an admixture between A and B to form C; populations A' and 
B' are present-day references. In practice, we assume that post- admixture drift is negligible and 
C = C. The branch points of A' and B' from the A-B lineage are marked A" and B"; note that 
in a rooted phylogeny, these need not be most recent common ancestors. 
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X"=A X"=B 
Branch position X" of reference pop 

Figure S2. Dependence of single-reference weighted LD amplitude on the reference population. 
When taking weights as allele frequency differences between the admixed population and a single 
reference population X', the weighted LD curve a(d) has expected amplitude proportional to 
(aF 2 (A, X") - (3F 2 (B, X")) 2 , where X" is the point along the A-B lineage at which the 
reference population branches. Note in particular that as X" varies from A to B, the amplitude 
traces out a parabola that starts at (3 2 F 2 (A, B) 2 , decreases to a minimum value of 0, and increases 
to a 2 F 2 {A,B) 2 . 
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A' C B' 

Figure S3. Non- admixture-related demography producing weighted LD curves. The test 
population is C and references are A' and B'\ the common ancestor of A' and C experienced a 
recent bottleneck from which C has not yet recovered, leaving long-range LD in C that is 
potentially correlated to all three possible weighting schemes (A'-B', A'-C, and B'-C). 
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Figure S4. Q-Q plots comparing ALDER ^-scores to standard normal on null examples. We show 
results from nine HGDP populations that neither ALDER nor the 3-population test found to be 
admixed. We are interested in values of z > 0; the Q-Q plots show that these values follow the 
standard normal reasonably well, tending to err on the conservative side. 



54 



Table SI. Dates of admixture estimated for simulated 75% YRI / 25% CEU mixtures. 



Ref 1 


Ref 2 


10 


20 


50 


100 


200 


Yoruba 


French 


9±1 


20±1 


49±2 


107±5 


195±9 


Yoruba 


Han 


9±1 


21 ±1 


50±2 


107±6 


191±12 


Yoruba 


Papuan 


9±1 


21 ±1 


49±3 


118±8 


223±23 


San 


French 


9±1 


20±1 


50±2 


109±4 


197±15 


San 


Han 


9±0 


21 ±1 


51 ±3 


111 ±4 


194±16 


San 


Papuan 


9±1 


21 ±1 


51 ±3 


115±6 


209±16 


Yoruba 




9±1 


21 ±1 


48±2 


107±5 


181±17 


San 




9±1 


20±2 


56±7 


1 39±22 


213±97 


French 




9±1 


20±1 


50±2 


108±3 


194±9 


Han 




9±0 


21 ±1 


52±2 


110±6 


192±17 


Papuan 




9±1 


21 ±1 


53±3 


125±8 


217±26 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various references. Rows in which only one reference is 
listed indicate runs using the admixed population itself as one reference. Note that standard errors 
shown are ALDER's jackknife estimates of its own error on a single simulation (not standard 
errors from averaging over multiple simulations). 
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Table S2. Dates of admixture estimated for simulated 90% YRI / 10% CEU mixtures. 



Ref 1 


Ref 2 


10 


20 


50 


100 


200 


Yoruba 


French 


10±0 


21 ±1 


50±2 


107±7 


193±19 


Yoruba 


Han 


10±0 


20±1 


51 ±2 


109±10 


220±32 


Yoruba 


Papuan 


10±0 


22±1 


53±3 


111 ±1 1 


233±65 


San 


French 


10±0 


21 ±1 


51 ±2 


112±6 


223±19 


San 


Han 


10±0 


21 ±1 


52±3 


121 ±5 


254±40 


San 


Papuan 


11±0 


23±1 


53±3 


126±8 


287±56 


Yoruba 




9±1 


20±2 


55±7 


100±27 


363±183 


San 




98±87 


56±28 


94±69 


2±0 


9±5 


French 




10±0 


21 ±1 


51 ±2 


107±5 


217±13 


Han 




1 1 ±0 


21 ±1 


52±2 


111 ±7 


234±25 


Papuan 




11±0 


23±1 


56±3 


117±8 


256±47 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various references. Rows in which only one reference is 
listed indicate runs using the admixed population itself as one reference. Note that standard errors 
shown are ALDER's jackknife estimates of its own error on a single simulation (not standard 
errors from averaging over multiple simulations). 
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Table S3. Amplitudes of weighted LD curves for simulated 75% YRI / 25% CEU mixtures. 



Ref 1 


Ref 2 


Expected 




10 




20 




50 




100 




200 


Yoruba 


French 


.00117348 





00113886 





00120266 





00118771 





00128303 





00120178 


Yoruba 


Han 


.00069348 





00067768 





00071728 





00071133 





00077409 





00071553 


Yoruba 


Papuan 


.00060198 





00059817 





00063078 





00059494 





00077475 





00083509 


San 


French 


.00101690 





00098098 





00102770 





00104422 





00112783 





00103681 


San 


Han 


.00057441 





00055613 





00059002 





00060394 





00066678 





00062555 


San 


Papuan 


.00049142 





00048654 





00051405 





00050258 





00058927 





00057406 


Yoruba 




.00007513 





00007673 





00008065 





00007447 





00008291 





00007067 


San 




.00003970 





00004023 





00004238 





00005001 





00006605 





00004306 


French 




.00065473 





00062625 





00065967 





00066600 





00072135 





00065649 


Han 




.00031208 





00030373 





00032398 





00033196 





00036439 





00033197 


Papuan 




.00025176 





00025596 





00027325 





00026722 





00033133 





00031424 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various references. Rows in which only one reference is 
listed indicate runs using the admixed population itself as one reference. Expected amplitudes 



were computed according to formulas (10) and (11). 
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Table S4. Amplitudes of weighted LD curves for simulated 90% YRI / 10% CEU mixtures. 



Ref 1 


Ref 2 


Expected 




10 




20 




50 




100 




200 


Yoruba 


French 


.00056327 





00058711 





00057909 





00054968 





00060027 





00056203 


Yoruba 


Han 


.00033287 





00035297 





00033611 





00033876 





00038110 





00045616 


Yoruba 


Papuan 


.00028895 





00030692 





00030312 





00030910 





00034328 





00042565 


San 


French 


.0004881 1 





00052217 





00051223 





00048849 





00051941 





00062467 


San 


Han 


.00027572 





00030482 





00029084 





00028906 





00033763 





00046412 


San 


Papuan 


.00023588 





00026571 





00026220 





00025429 





00030626 





00048601 


Yoruba 




.00000589 





00000625 





00000617 





00000668 





00000720 





00004396 


San 




.00000062 





00001592 





00000771 





00000994 


-0 


00000026 


-0 


00000111 


French 




.00045389 





00047325 





00047123 





00044982 





00048074 





00056590 


Han 




.00025015 





00026809 





00026055 





00026445 





00028847 





00036873 


Papuan 




.00021228 





00023116 





00023263 





00024258 





00027553 





00036597 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various references. Rows in which only one reference is 
listed indicate runs using the admixed population itself as one reference. Expected amplitudes 



were computed according to formulas (10) and (11). 
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Table S5. Mixture fraction lower bounds on simulated 75% YRI / 25% CEU mixtures. 



Ref 


10 


20 


50 


100 


200 


French 


24.6±0.3 


25.7±0.5 


25.7±0.7 


27.0±1.0 


25.2±1.3 


Russian 


23.8±0.3 


24.9±0.5 


24.8±0.7 


25.6±0.8 


25.3+1.0 


Sardinian 


21.3±0.3 


21.9±0.5 


22.0±0.6 


23.6±0.9 


22.3±1.1 


Kalash 


14.7±0.2 


15.5±0.4 


15.5±0.5 


16.4±0.6 


15.6±0.9 


Yoruba 


73.6±0.7 


74.8±0.4 


74.0±0.6 


76.2±1.3 


73.8±3.4 


Mandenka 


50.5±0.6 


51.2±1.0 


50.4±1.4 


54.9±2.0 


60.8±5.6 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various single references. The first four rows are 
European surrogates and give lower bounds on the amount of CEU ancestry (25%); the last two 
are African surrogates and give lower bounds on the amount of YRI ancestry (75%). Note that 
standard errors shown are ALDER's jackknife estimates of its own error on a single simulation 
(not standard errors from averaging over multiple simulations). 
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Table S6. Mixture fraction lower bounds on simulated 90% YRI / 10% CEU mixtures. 



Ref 


10 


20 


50 


100 


200 


French 


10.5±0.4 


10.5±0.3 


9.9±0.3 


10.6±0.4 


12.3±1.0 


Russian 


10.2±0.3 


10.0±0.3 


9.7±0.3 


10.3±0.5 


11.8±0.9 


Sardinian 


9.3±0.3 


9.2±0.3 


8.7±0.3 


9.5±0.4 


10.3±1.2 


Kalash 


7.2±0.3 


7.0±0.3 


6.8±0.2 


7.4±0.4 


8.9±0.8 


Yoruba 


89.1±1.0 


89.1 ±1.1 


90.1 ±1.5 


89.4±3.7 


98.5±2.0 


Mandenka 


18.2±2.3 


17.3±2.5 


19.5±4.8 


63.1 ±25.5 


30.7±220.4 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using various single references. The first four rows are 
European surrogates and give lower bounds on the amount of CEU ancestry (10%); the last two 
are African surrogates and give lower bounds on the amount of YRI ancestry (90%). Note that 
standard errors shown are ALDER's jackknife estimates of its own error on a single simulation 
(not standard errors from averaging over multiple simulations). 
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Table S7. Dates of admixture estimated for simulated 75% YRI / 25% CEU mixtures. 



Yoruba-French references 


Samples 


10 gen 


20 gen 


50 gen 


100 gen 


200 gen 


5 


12±2 


18±2 


55±3 


103±7 


258=b24 


10 


10±1 


19=L2 


50±2 


105±7 


236=b24 


20 


10±1 


20±1 


52±2 


104±5 


223=b16 


50 


9±0 


20±1 


52=b1 


96±2 


186±10 


100 


10±0 


20±0 


52±1 


101 ±2 


210=b9 


San-Han references 


Samples 


10 gen 


20 gen 


50 gen 


100 gen 


200 gen 


5 


12±2 


18±2 


58±5 


107±11 


283=b73 


10 


10±1 


19±2 


54±3 


1 14=1=8 


219±64 


20 


10±1 


21 ±1 


55±2 


1 15=1=6 


219=b46 


50 


9±0 


21 ±1 


54±1 


107=b5 


213=b20 


100 


9±0 


21 ±1 


53±1 


105=b5 


216=b13 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using varying numbers of admixed samples. Note that standard 
errors shown are ALDER's jackknife estimates of its own error on a single simulation (not 
standard errors from averaging over multiple simulations). 
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Table S8. Dates of admixture estimated for simulated 90% YRI / 10% CEU mixtures. 



Yoruba-French references 


Samples 


10 gen 


20 gen 


50 gen 


100 gen 


200 gen 


5 


11 ±2 


21 ±2 


52±6 


101=1=17 


253±42 


10 


11±1 


19±1 


48±4 


94±8 


241 ±46 


20 


11±1 


21 ±1 


48±3 


102±8 


209±30 


50 


11±0 


21 ±1 


48±2 


98±5 


202±21 


100 


10±0 


20±1 


50±1 


99±4 


185±15 


San-Han references 


Samples 


10 gen 


20 gen 


50 gen 


100 gen 


200 gen 


5 


14±2 


22±3 


63±8 


110±30 


335±91 


10 


12±1 


20±2 


54±4 


1 1 0=b15 


265±55 


20 


12±1 


21 ±1 


52±4 


131±15 


234±33 


50 


1 1 ±0 


20±1 


53±4 


122±8 


221 ±23 


100 


11±0 


20±0 


53±3 


109±5 


219±10 



We simulated scenarios in which admixture occurred 10, 20, 50, 100, or 200 generations ago and 
show results from runs of ALDER using varying numbers of admixed samples. Note that standard 
errors shown are ALDER's jackknife estimates of its own error on a single simulation (not 
standard errors from averaging over multiple simulations). 
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Table S9. Effect of SNP ascertainment on date estimates. 



Mixed pop 


Ref 1 


Ref 2 


French asc 


Han asc 


San asc 


Yoruba asc 


Burusho 


French 


Han 


47±12 


51±13 


56±10 


41 ±10 


Uygur 


French 


Han 


15±2 


14±2 


13±2 


16±2 


Hazara 


French 


Han 


22±2 


22±3 


23±2 


22±3 


Melanesian 


Dai 


Papuan 


93±24 


62±15 


76±13 


70±18 


Bedouin 


French 


Yoruba 


27±3 


23±3 


23±3 


24±3 


MbutiPygmy 


San 


Yoruba 


33±12 


33±6 


41±14 


30±8 


BiakaPygmy 


San 


Yoruba 


39±6 


50±14 


35±6 


36±7 



We compared dates of admixture estimated by ALDER on a variety of test triples from the HGDP 
using SNPs ascertained as heterozygous in full genome sequences of one French, Han, San, and 



Yoruba individual (Panels 1, 2, 4, and 5 of the Affymetrix Human Origins Array (PATTERSON 



et al. 2012)). Standard errors are from a jackknife over the 22 autosomes. 
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File SI. Unbiased poly ache estimator for weighted LD using the admixed population itself 
as one reference. 



in[28]:= MomentConvert [ 

MomentConvert[CentralMoment[{l, 2}] (Moment [ { 1, 0} ] - pAx) (Moment[{0, 2}] - pAy) , 
"UnbiasedSampleEstlmator"] , "PowerSymmetrlcPolynomial"] II TraditionalForm 

Out[28]//TraditionalForm= 

pAx pAy 5i, 5 ,i pAx pAy 5 M (S (2) + So) pAx S 1>0 S ,i 2 pAx 5 U S ,i (2 S <2) + S <3) ) 

+ 5 So< 2 ' ~ + 5 < 3 ' S (2) V 3) 

pAx%>Si,o pAx5 1 , 2 (2 5o <2) + 5 <3) ) pAy5 1>0 2 5o,i pAyS 2 , So,i pAy5i, 5 u (2 5 (2) + S (3) ) 

5 (3) 5o (2) 5o <3) + So^ 5o (2) 5 (3) + 

P Ay5 2 , 1 (2 5o <2) + 5 <3) ) S 1>0 2 5 ,i 2 5 2 , 5 ,i 2 S h0 S u S ,i (4S (3) + S (4) ) 5 2>1 5 0>1 (-4 5 <3) - 5 <4) ) 

5o (2) 5 <3) 5 (4) + 5 <4) 5o (3) 5o <4) + S <3) V 4) + 

50,2 V S u 2 (-2So (3 >-S (4 >) S U0 S U2 {-4S W-S W) S Q , 2 S 2 , Q 2 5 2>2 (3 5 <3) + 5 <4) ) 

H + h ■ 



5 (4) 5 < 3 '5o (4) 5o (3) S (4) S (4) V 3) So' 



(4) 



Mathematica code and output are shown for computing the polyache statistic that estimates the 
one-reference weighted LD, E[(X — fi x )(Y — A^X/^x — PA(x))(fi y — pa(i/))], where Pa(-) are 
allele frequencies of the single reference population and /jl x and fi y denote allele frequencies of the 
admixed population. In the above, := m(m — 1) • • • (to — k + 1) and S TjS := Y^T=i -^lYf, 
where to is the number of admixed samples and % ranges over the admixed individuals, which have 
allele counts Xi and F; at sites x and y. 
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File S2. FFT computation of weighted LD. 



In this note we describe how to compute weighted LD (aggregated over distance bins) in time 



where m is the number of admixed individuals, S is the number of SNPs, and B is the number 
of bins needed to span the chromosomes. In contrast, the direct method of computing pairwise 
LD for each individual SNP pair requires 0(mS 2 ) time. In practice our approach offers speedups 
of over lOOOx on typical data sets. We further describe a similar algorithm for computing the 
single -reference weighted LD polyache statistic that runs in time 



with the slight trade-off of ignoring SNPs with missing data. 

Our method consists of three key steps: (1) split and factorize the weighted LD product; (2) 
group factored terms by bin; and (3) apply fast Fourier transform (FFT) convolution. As a special 
case of this approach, the first two ideas alone allow us to efficiently compute the affine term (i.e., 
horizontal asymptote) of the weighted LD curve using inter-chromosome SNP pairs. 



We first establish notation. Say we have anSxm genotype array {c Xti } from an admixed popula- 
tion. Assume for now that there are no missing values, i.e., 



for x indexing SNPs by position on a genetic map and % = 1, . . . , m indexing individuals. Given 
a set of weights w x , one per SNP, we wish to compute weighted LD of SNP pairs aggregated by 
inter-SNP distance d: 



0(m(S + B log B)), 



0(m 2 (S + B log B)) 



TWO-REFERENCE WEIGHTED LD 



c x ,i e {0,1,2} 




v 



\x—y\md 
x<y 
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where D 2 is the sample covariance between genotypes at x and y, the diploid analog of the usual 
LD measure D: 



-j^ hi, -j^ iii, lib 

D 2 (x, y) := E Cx , iC yt ~ m ( m _ f/j E E °y<'j 



i=i i=i j=i 

1 ^ 1 

i=i v ' 



(1) 



where we have defined 

m 
i=l 

Substituting for D 2 (x, y), we have 

1 / 1 m 1 

|x-J/|RJd \ 1=1 



S X Sy I W X Wy 



"I) 

y^-^ y Cx ^-c y ^] — -4 — - y s ^- w (2) 

^ 2 m - 1) ' y ' y 2m m - 1 ^ y w 

i=l V 7 |x-y|wd / V 1 |x-y|wd 

We have thus rewritten as a linear combination of m + 1 terms of the form 

E /(*)/(!/)■ 

|x—J/| Rid 

(The sum over % consists of m such terms, and the final term accounts for one more.) 
In general, sums of the form 

E MM 

\x-y\fad 

can be efficiently computed by convolution if we first discretize the genetic map on which the 
SNP positions x and y lie. For notational convenience, choose the distance scale such that a unit 
distance corresponds to the desired bin resolution. We will compute 

E f(*)9(v)- (3) 

\_x\-[y\=d 

That is, we divide the chromosome into bins of unit distance and aggregate terms f(x)g(y) by the 
distance between the bin centers of x and y. Note that this procedure does not produce exactly the 
same result as first subtracting the genetic positions and then binning by \x — y\ : with our approach, 
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pairs (x, y) that map to a given bin can have actual distances that are off by as much as one full bin 
width, versus half a bin width with the subtract-then-bin approach. However, we can compensate 
simply by doubling the bin resolution. 
To compute expression ([3]), we write 

E f( x )a(y) = EE E fWaiv) 

\_x\-\_y\=d b=0 [x\=b[y\=b~d 

= E[ E/^)| ( E 9(y)). (4) 

b=0 \[x\=b J \ly\=b-d J 



Writing 



expression Q becomes 



F(b) := E /(*)> G ( b ) ■= E 9^) 

[x\=b [x\=b 



E F(b)G(b — d) = (F* G)(d), 

6=0 

a cross-correlation of binned f(x) and terms. 

Computationally, binning / and g to form F and G? takes O(S) time, after which the cross- 
correlation can be performed in 0(B log B) time with a fast Fourier transform. The full computa- 
tion of the m + 1 convolutions in equation thus takes 0(m(S + B log 5)) time. In practice we 
often have B log B < S, in which case the computation is linear in the data size mS. 

One additional detail is that we usually want to compute the average rather than the sum of 
the weighted LD contributions of the SNP pairs in each bin; this requires normalizing by the 
number of pairs (x, y) that map to each bin, which can be computed in an analogous manner 
with one more convolution (setting / = 1, g = 1). Finally, we note that our factorization and 
binning approach immediately extends to computing weighted LD on inter-chromosome SNP pairs 
(by putting all SNPs in a chromosome in the same bin), which allows robust estimation of the 
horizontal asymptote of the weighted LD curve. 



Missing Data The calculations above assumed that the genotype array contained no missing data, 
but in practice a fraction of the genotype values may be missing. The straightforward non-FFT 
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computation has no difficulty handling missing data, as each pairwise LD term D 2 (x,y) can be 
calculated as a sample covariance over just the individuals successfully genotyped at both x and 
y. Our algebraic manipulation runs into trouble, however, because if k individuals have a missing 
value at either x or y, then the sample covariance contains denominators of the form \/{m — k — \) 
and 1/ (to — k)(m — k — 1) — and k varies depending on x and y. 

One way to get around this problem is simply to restrict the analysis to sites with no missing 
values at the cost of slightly reduced power. If a fraction p of the SNPs contain at least one missing 
value, this workaround reduces the number of SNP pairs available to (1 — p) 2 of the total, which 
is probably already acceptable in practice. 

We can do better, however: in fact, with a little more algebra (but no additional computational 
complexity), we can include all pairs of sites (x, y) for which at least one of the SNPs x, y has no 
missing values, bringing our coverage up to 1 — p 2 . 

We will need slightly more notation. Adopting eigenstrat format, we now let our genotype 
array consist of values 

c X:i G {0,1,2,9} 

where 9 indicates a missing value. (Thus, {c Xji } is exactly the data that would be contained in a 
. geno file.) For convenience, we write 



c (0) •= 



Cx,i if c x ,i e {0, 1,2} 
otherwise. 



That is, c£J replaces missing values with Os. As before we set 

TO 

s ■= V c =Vc (0) 

to be the sum of all non-missing values at x, which also equals the sum of all cfj because the 
missing values have been O-replaced. Finally, define 

kx ■ ^{^ • Cx,i 9} 
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to be the number of missing values at site x. 

We now wish to compute aggregated weighted LD over pairs (x, y) for which at least one of 
k x and k y is 0. Being careful not to double-count, we have: 



R(d) := ^ D 2 (x,y)w x w y 

\x-y\xsd 
x<y 
k x =0 or k y =0 

= - ^2 D 2 (x,y)w x w y + ^2 D 2 (x,y)w x w y 

\x-y\ttd \x-y\ad 
k x =0 and k y =0 k x =0 and k y ^0 

= E iTi\k~y = 0] D2iX ' y)W ^ ^ 

\x-y\vd 1 y 1 

where the shorthand /[•] denotes a {0, l}-indicator. 

Now, for a pair of sites (x, y) where x has no missing values and y has k y missing values, 

1 m 1 / m \ 

D 2 {x, y) = V c x ^c y °l - — \s x - Y] I[cy i = 9]c Xii s y . (6) 

m — k y — 1 y ' [m — ky)(m — ky — 1) \ J 

Indeed, we claim the above equation is actually just a rewriting of the standard covariance formula 
([T]), appropriately modified now that the covariance is over m — k y values rather than m: 

• In the sum Y^T=i c x,i c y % missing values in y have been 0-replaced, so those terms vanish 
and the sum effectively consists of the desired m — k y products c X;i c yii . 

• Similarly, s y is equal to the sum of the m — k y non-missing c y ^ values. 

• Finally, s x — YlT=i I\ c y,i = §} c x,i represents the sum of c X:i over individuals i successfully 
genotyped at y, written as the sum s x over all m individuals minus a correction. 
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Substituting ([6]) into expression ([5]) for R(d) and rearranging, we have 

R(d) = V I[kx = 0] ( 1 Vc C (0) 

1 / m 

j=l 



(m — ky){m — k y — 1) 
^J I[kx = ° ]SxWx) ' ((l + /^ = 0])(m-\)(m-A;,-l)) ' 



\x-y\txd 

The key point is that we once again have a sum of m+1 convolutions of the form Y^\x-y\md f{ x )9(v) 
and thus can compute them efficiently as before. 

ONE-REFERENCE WEIGHTED LD 
When computing weighted LD using the admixed population itself as a reference with one other 



reference population, a polyache statistic must be used to obtain an unbiased estimator (File SI ). 
The form of the polyache causes complications in our algebraic manipulation; however, if we 
restrict our attention to SNPs with no missing data, the computation can still be broken into con- 
volutions quite naturally, albeit now requiring 0(m 2 ) FFTs rather than 0(m). 

As in the two-reference case, the key idea is to split and factorize the weighted LD formula. 
We treat the terms in the polyache separately and observe that each term takes the form of a 
constant factor multiplied by a product of sub-terms of the form S r , a , Pa(x), or Pa(v)- We can use 
convolution to aggregate the contributions of such a term if we can factor it as a product of two 
pieces, one depending only on x and the other only on y. Doing so is easy for some terms, namely 
those that involve only pa(x), Pa(v), S rfi , and S 0iS , as the latter two sums depend only on x and y, 
respectively. 

The terms involving S TjS with both r and s nonzero are more difficult to deal with but can 
be written as convolutions by further subdividing them. In fact, we already encountered Si i = 
YmLi Cx,iC y ,i in our two-reference weighted LD computation: the trick there was to split the sum 
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into its m components, one per admixed individual, each of which could then be factored into 
x-dependent and y-dependent parts and aggregated via convolution. 

Exactly the same decomposition works for all of the polyache terms except the one involving 
S 2 ^. For this term, we write 

m m mm 

Si i = ^ ] Cx,iCy,i ^ ^ Cx,jCy,j — ^ ] ^ ] c x,i c x,j ' c y,i c y,ji 
i=l j=l i=l j=l 

from which we see that splitting the squared sum into m 2 summands allows us to split the x- and 
y-dependence as desired. The upshot is that at the expense of 0(m 2 ) FFTs (and restricting our 
analysis to SNPs without missing data), we can also accelerate the one-reference weighted LD 
computation. 
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